Course overview

Author
Affiliation

Dave Clark

Binghamton University

Published

January 19, 2025

About this class

All models are wrong \(\ldots\)

“All models are wrong, some are useful.” Box & Draper (1987)

This course is based on the principle that to build useful models, we need:

  • an intuitive understanding of regression models
  • data science skills
  • careful and creative thinking about politics
  • a general skepticism of individual models, but an optimism about the modeling enterprise

The regression model

This course focuses on the linear regression model, but with a heavy emphasis on data science skills like data management/wrangling, and coding.

Course Goals

In May, you should be able to:

  • develop and test hypotheses about politics.

  • estimate, evaluate, and interpret basic linear and nonlinear regression models.

  • evaluate, understand, clean, wrangle, join, and reshape data.

  • write R code, to manage data, implement simulation methods, estimate models, visualize data/predictions.

  • understand the data generation process as a motivation for modeling.

This class

Is not a math class,

but one that uses basic math toward an applied goal.

This class

  • is aimed at application; the more you struggle with code, the more you’ll learn.

  • the more you use other people’s code, the less you’ll learn.

  • the more you look at other people’s code, the more you’ll learn. You should live on stackoverflow etc.

  • is applied - if you apply the tools we cover, you’ll learn a lot; if not, you won’t.

  • the students who do best later on in this program challenge themselves here.

This class

  • is a collaborative effort among you, and with me and Oguzhan. We learn from each other when we collaborate.

  • This means having honest conversations in class about what we do and do not understand.

  • It means working together to learn R, to wrangle data, and to understand models; working together on your assignments.

  • the most successful cohorts are those that work collaboratively; this does not mean free-riding, but struggling with everything together.

What do you need to know for 501?

What do you need at the start?

Matrix algebra

You should be able to make sense of this in terms of dimensions and operations:

\[\beta = (X'X)^{-1} X'y\] where \(X\) is a data matrix of rows and columns; these are the variables in a regression. \(y\) is the outcome variable, the same number of rows as \(X\). Define the dimensions of \(X\), \(y\), and \(\beta\), of \((X'X)^{-1}\) and \(X'y\).

Probability theory

  • what are probability distributions?
  • what is the PDF of a variable?
  • what is the CDF of a variable?

Understanding Regression models

The class focuses on regression models of the general form (scalar notation):

\[ y_i = \beta_0 + X_1\beta_1 + X_2\beta_2 \ldots + X_k\beta_k + \epsilon\]

or in matrix notation,

\[ \mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\epsilon} \]

An outcome, \(y\) is a function of some combination of \(X\) variables, their coefficients (\(\beta\), including an intercept), and an error term.

Regression models

Posit a statistical relationship between:

  • an outcome variable, \(y\).
  • a covariate of interest, \(x\).
  • a set of controls, \(\mathbf{X}\).

The effect of \(x\) is denoted \(\beta\). It is the partial effect of x on y, because removed from that effect are the effects of x -> X -> y. This is what we mean by controlling for the effects of X. Much more on this later.

Regression models

Can be estimated using different technologies including:

  • Least Squares
  • Maximum Likelihood
  • Bayesian methods

This course is about the technology of least squares; we will deal with one ML regression model (logit) later in the semester.

Illustration

Models to understand politics

\(~\)

This a methods class, but the methods are only meaningful to us insofar as they help us understand politics. So let’s motivate the class with a question about politics.

Careful thinking

Correlation does not imply causation.


Repression during the COVID-19 Pandemic

Among the products of the pandemic is an opportunity for governments, under the guise of protecting public health, to repress citizens.

There is some evidence of democratic backsliding prior to the pandemic, but the global public health crisis gave governments a specific and universal opportunity to take advantage of this softened ground and to become more authoritarian.

We’re going to examine the covariates of violence against civilians during the COVID period, focusing on how states of emergency create changes in government violence.


Data

Let’s look at https://acleddata.com, and data on states of emergency during the COVID period from @lundgren2020emergency.

The next two figures plots protests and repression since mid-2019, the onset of the pandemic marked at March 15, 2020.

The green lines are local regressions showing trends; while protests return to pre-pandemic levels (similar to the years prior), repression increases since the onset of COVID-19.


Protests

code
acled<-read.csv("/Users/dave/Documents/2013/PITF/analysis/protests2021/data/2022analysis/acled2022.csv")

##Event_date variable has a "Chr" format. 
#Need to convert it into date format to calculate 7 day averages for Protests and Violence against Civilians

acled$event_date<-as.Date(acled$event_date,format="%d %B %Y")
#class(acled$event_date)

##Compute number of events per day

acled$perday<-1

protests<-filter(acled,event_type=="Protests" )
stateviol<-filter(acled, event_type=="Violence against civilians")
protests<-aggregate(perday ~ event_date, data = protests, sum)
stateviol<-aggregate(perday ~ event_date, data = stateviol, sum)


##Calculating a 7 day moving average
protests$rollmean<-rollmean(protests$perday,k=7, fill=NA)
stateviol$rollmean<-rollmean(stateviol$perday,k=7, fill=NA)

protests <- protests %>% filter(protests$event_date < "2022-06-25")
stateviol <- stateviol %>% filter(stateviol$event_date < "2022-06-25")

# protestplot <- ggplot(data=protests, aes(x=event_date, y=rollmean)) +
#   geom_line()+
#   geom_smooth() +
#   geom_vline(xintercept=as.numeric(as.Date("2020-03-15")), color="red") +
#   scale_x_date(date_breaks = "6 months", date_labels = '%b %Y')  +
#   labs(x="Date", y="Protests", caption="7 day rolling averages; smoothed loess trend lines; red lines indicate 3.15.2020 COVID onset; 'ACLED data retrieved 7.18.22, http://acleddata.com") + 
#   ggtitle("Protests During the Pandemic") 
#    
# 
# protestplot


library(highcharter)
library(dplyr)

# First handle potential NA values and ensure data is properly ordered
protests <- protests %>%
  arrange(event_date) %>%
  # Optionally remove NA values if present
  filter(!is.na(rollmean), !is.na(event_date))

# Calculate the smoothed trend line with explicit handling of the data points
x_vals <- as.numeric(protests$event_date)
smooth_fit <- loess(rollmean ~ x_vals, data = protests, span = 0.75)
protests$smooth_trend <- predict(smooth_fit, x_vals)

bucolors<-list("#005A43","#6CC24A", "#A7DA92", "#BDBEBD", "#000000" )

# Create the highchart
hchart <- highchart() %>%
  hc_add_series(
    data = protests,
    type = "line",
    hcaes(x = event_date, y = rollmean),
    color= "#000000",
    name = "Rolling Average"
  ) %>%
  hc_add_series(
    data = protests,
    type = "line",
    hcaes(x = event_date, y = smooth_trend),
    name = "Trend",
    dashStyle = "Solid",
    color = "#005A43",
    opacity = 0.7
  ) %>%
  hc_xAxis(
    plotLines = list(list(
      value = as.numeric(as.POSIXct("2020-03-15")) * 1000,
      color = "red",
      width = 2,
      zIndex = 3
    )),
    type = "datetime",
    dateTimeLabelFormats = list(month = "%b %Y"),
    tickInterval = 6 * 30 * 24 * 3600 * 1000
  ) %>%
  hc_yAxis(
    title = list(text = "Protests")
  ) %>%
  hc_title(
    text = "Protests During the Pandemic"
  ) %>%
  hc_caption(
    text = "7 day rolling averages; smoothed loess trend lines; red line indicates 3.15.2020 COVID onset; ACLED data retrieved 7.18.22",
    align = "left",
    style = list(fontStyle = "italic")
  ) %>%
  hc_tooltip(
    shared = TRUE,
    crosshairs = TRUE
  ) %>%
  hc_credits(enabled = FALSE)

# Display the chart
hchart

Repression

code
# repressionplot <- ggplot(data=stateviol, aes(x=event_date, y=rollmean)) +
#   geom_line()+
#   geom_smooth() +
#   geom_vline(xintercept=as.numeric(as.Date("2020-03-15")), color="red") +
#   scale_x_date(date_breaks = "6 months", date_labels = '%b %Y')  +
#   labs(x="Date", y="Violence against civilians", caption="7 day rolling averages; smoothed loess trend lines; red lines indicate 3.15.2020 COVID onset; ACLED data retrieved 7.18.22, http://acleddata.com") + 
#   ggtitle("Violence Against Civilians During the Pandemic") 
#   
# repressionplot

# First handle potential NA values and ensure data is properly ordered
stateviol <- stateviol %>%
  arrange(event_date) %>%
  # Optionally remove NA values if present
  filter(!is.na(rollmean), !is.na(event_date))

# Calculate the smoothed trend line with explicit handling of the data points
x_vals <- as.numeric(stateviol$event_date)
smooth_fit <- loess(rollmean ~ x_vals, data = stateviol, span = 0.75)
stateviol$smooth_trend <- predict(smooth_fit, x_vals)

bucolors<-list("#005A43","#6CC24A", "#A7DA92", "#BDBEBD", "#000000" )

# Create the highchart
hchart <- highchart() %>%
  hc_add_series(
    data = stateviol,
    type = "line",
    hcaes(x = event_date, y = rollmean),
    color= "#000000",
    name = "Rolling Average"
  ) %>%
  hc_add_series(
    data = stateviol,
    type = "line",
    hcaes(x = event_date, y = smooth_trend),
    name = "Trend",
    dashStyle = "Solid",
    color = "#005A43",
    opacity = 0.7
  ) %>%
  hc_xAxis(
    plotLines = list(list(
      value = as.numeric(as.POSIXct("2020-03-15")) * 1000,
      color = "red",
      width = 2,
      zIndex = 3
    )),
    type = "datetime",
    dateTimeLabelFormats = list(month = "%b %Y"),
    tickInterval = 6 * 30 * 24 * 3600 * 1000
  ) %>%
  hc_yAxis(
    title = list(text = "Violence Against Civilians")
  ) %>%
  hc_title(
    text = "Violence Against Civilians During the Pandemic"
  ) %>%
  hc_caption(
    text = "7 day rolling averages; smoothed loess trend lines; red line indicates 3.15.2020 COVID onset; ACLED data retrieved 7.18.22",
    align = "left",
    style = list(fontStyle = "italic")
  ) %>%
  hc_tooltip(
    shared = TRUE,
    crosshairs = TRUE
  ) %>%
  hc_credits(enabled = FALSE)

# Display the chart
hchart

States of Emergency during the Pandemic

Here are the onsets of states of emergency declared during the pandemic (from @lundgren2020emergency).

code
####################
#analysis data
####################

#make analysis data frame
analysisdata <- left_join(perday, soe, by=c("ccode"="ccode", "event_date"="event_date"))
analysisdata <- left_join(analysisdata, covid, by=c("ccode"="ccode", "event_date"="date"))

#zeros for NA in soe_s
analysisdata <- analysisdata %>% mutate(across(soe_s, ~replace_na(., 0)))

#fill soe series after onset
analysisdata <- analysisdata %>% group_by(ccode) %>% mutate(cumSOE = cumsum(soe_s)) %>% ungroup

#plot cumulative states of emergency over time
totalsoe <- analysisdata %>% group_by(event_date) %>%
  summarise(across(cumSOE, sum)) %>%
  ungroup() 

totalsoe$event_date <- as.Date(totalsoe$event_date)

totalsoe <- totalsoe %>% filter(event_date >"2020-01-01"& event_date <"2020-07-01")

ggplot() +
  geom_line(data=totalsoe, aes(x=event_date, y=cumSOE))+
  scale_x_date(date_breaks = "1 months", date_labels = '%b %Y') +
  labs ( colour = NULL, x = "Date", y =  "Active States of Emergency" )  


Modeling repression

Let’s build a simple model predicting violence against civilians. The data here are daily, covering 167 countries between January 2020 and July 2022. We’ll predict the 7 day moving average of repression depending on whether a state of emergency is in place.

\(~\)

My expectation is states of emergency will make repression easier to motivate and therefore, more frequent.


The model

\(y\) = violence against civilians

\(x\) = states of emergency

Controlling for:

  • new daily deaths from COVID (logged)
  • percentage of the population over 65 years old
  • the Human Development Index
  • the 7 day average number of protests.

Modeling Repression and SoEs

This is a linear regression model estimated using OLS.

code
##################
#models
#################

analysisdata$soeprotests <- analysisdata$cumSOE*analysisdata$protest7day
analysisdata$lnd =  log(analysisdata$new_deaths+.01)
#label variables for coef plot
library("labelled")
analysisdata <- analysisdata %>%
  set_variable_labels(
    cumSOE = "State of Emergency",
    lnd = "ln(New COVID deaths)",
    aged_65_older = "% age 65 +",
    human_development_index = "Human Development Index",
    protest7day = "Protests, 7 day avg",
    soeprotests = "SoE * Protests, 7 day avg", 
    `Violence against civilians` = "Violence against Civilians"
  )

#model predicting repression during covid 

m1 <- lm(data=analysisdata, `Violence against civilians` ~ cumSOE  + lnd + aged_65_older+human_development_index + protest7day )
#summary(m1)  

library(stargazer)

stargazer(m1, type = "html", 
          title = "Predictors of Violence Against Civilians",
          dep.var.labels = c("Violence against Civilians"),
          covariate.labels = c("Intercept", "State of Emergency", "ln(New COVID deaths)", "% age 65 +", "Human Development Index", "Protests, 7 day avg"),
          omit.stat = c("f", "ser"),
          single.row = TRUE,
          intercept.bottom = FALSE,
          digits.extra = 0,
          star.char = c("*", "**", "***"),
          star.cutoffs = c(0.05, 0.01, 0.001),
          notes.append = FALSE,
          notes.label = "",
          notes.align = "l",
          no.space=TRUE
)
Predictors of Violence Against Civilians
Dependent variable:
Violence against Civilians
Intercept 0.930*** (0.025)
State of Emergency 0.256*** (0.008)
ln(New COVID deaths) 0.015*** (0.002)
% age 65 + -0.024*** (0.001)
Human Development Index -0.685*** (0.041)
Protests, 7 day avg 0.043*** (0.001)
Observations 165,211
R2 0.057
Adjusted R2 0.057
p<0.05; p<0.01; p<0.001

Another look

Tables of estimates are one way to present results; here’s an alternative using a coefficient plot.

code
#coefficient plot
ggcoef_model(
  m1,
  show_p_values = FALSE,
  signif_stars = FALSE, 
  exponentiate =FALSE, 
  intercept=TRUE,
  stripped_rows = FALSE,
  colour=NULL, 
  include = c("cumSOE", "lnd", "aged_65_older", "human_development_index", "protest7day")
)+
  xlab("Estimates and Confidence Intervals") +
  ggtitle("Predictors of Violence Against Civilians") 


Discussion

The estimates indicate:

  • if all the \(X\) variables are set to zero, the expected violence against civilians is 0.93 (the intercept). Can we reasonably set all the \(X\) variables to zero?

  • the difference between a state with a SoE and without is different from zero, but quite small; 0.26 uses of violence over a 7 day period.

  • more deaths from COVID are associated with more violence against civilians. The coefficient is 0.015, meaning that for every increase of 100 deaths, we expect a 1.5% increase in violence against civilians. This is the largest (perhaps only) interesting result in the model.


Quantities of interest (simulation)

Let’s generate predictions from the model by simulating the distribution of \(\widehat{\beta}\):

  • sample 1000 times from a multivariate normal distribution with mean \(\widehat{\beta}\)s and variances of \(var(\widehat{\beta})\).

  • plug those simulated \(\widehat{\beta}\)s into the equation using mean/median/mode values for the \(x\)s.

  • save the appropriate predictions - 2.5%, 50%, and 97.5% quantiles of the distribution of \(\widehat{y}\).


Simulating quantities

First, set \(SoE = 0\); vary \(covid~deaths = ln(1 \ldots 3000)\), \(\%~age~ 65 = 6\) (the median), \(HDI = .74\) (the median), and \(protests = 2\) (the median).

\[ \widehat{y} = \beta_0 + \beta_1* 0 +\beta_2*log(p) + \beta_3*6 +\beta_4*.74+ \beta_5*2 \]

then, repeat but with \(SoE = 1\):

\[ \widehat{y} = \beta_0 + \beta_1*1 +\beta_2*log(p) + \beta_3*6 +\beta_4*.74+ \beta_5*2 \]


Predicted Violence against Civilians

code
#simulated quantities
sigma <- vcov(m1)
B <- data.frame(rmvnorm(n=1000, mean=coef(m1), vcov(m1)))

colnames(B) <-c('b0', 'b1', 'b2', 'b3', 'b4', 'b5')

predictions <- data.frame ( lb0 = numeric(0),med0= numeric(0),
      ub0= numeric(0),lb1= numeric(0),med1= numeric(0),ub1= 
      numeric(0), deaths = numeric(0))

for (p in seq(0,3000,100)) {
  
  xbRA0  <- quantile(B$b0 + B$b1*0 + B$b2*log(p) + B$b3*6 +B$b4*.74+ 
                       B$b5*2, probs=c(.025,.5,.975))
  xbRA1 <- quantile(B$b0 + B$b1*1 + B$b2*log(p) + B$b3*6 +B$b4*.74+ 
                      B$b5*2, probs=c(.025,.5,.975))
  xbRA0<- data.frame(t(xbRA0))
  xbRA1<- data.frame(t(xbRA1))
  predictions[p:p,] <- data.frame(xbRA0, xbRA1, p)
}
predictions$lndeaths <- log(predictions$deaths)

#plot 
ggplot()+
  geom_ribbon(data=predictions, aes(x=deaths, ymin=lb0, ymax=ub0),fill = "grey70", alpha = .4, ) +
  geom_ribbon(data=predictions, aes(x=deaths, ymin=lb1, ymax=ub1), fill= "grey60",  alpha = .4, ) +
  geom_line(data= predictions, aes(x=deaths, y=med0))+
  geom_line(data= predictions, aes(x=deaths, y=med1))+
  labs ( colour = NULL, x = "New COVID Deaths", y =  "Expected Repression Episodes" ) +
  annotate("text", x = 500, y = .82, label = "State of Emergency", size=3.5, colour="gray30")+
  annotate("text", x = 1000, y = .55, label = "No State of Emergency", size=3.5, colour="gray30")


Discussion

  • SoE states have higher levels of violence against civilians, but the difference is very small, making me think we haven’t found anything particularly interesting here.

  • Covid deaths are associated with more violence; the effect is small over the range plotted here. Recall, the actual range is 0-28,000 in the model, driven by a couple of extreme outliers. I’m only plotting out to 3000 deaths in the predictions.


This model mainly tells us there’s a difference between SoE and non-SoE states - that difference in terms of repression events over a seven day period is relatively small. Moreover, it’s possible (even likely) repression leads to states of emergency - it’s also likely repression leads to protests in some cases - in other words, the causal arrow likely flows the opposite direction.

This probably is not a terribly useful model, but it serves the purposes of showing us the basic modeling enterprise, and points to some interesting questions we’ll know how to model by the end of the semester. Specifically, we’ll know how to model causal effects that might run both directions - doing so may well produce a more useful model. In any event, that model will still be wrong.

Our goal this semester is to learn to build useful models, and to understand the data generation process that underlies those models.

Back to top